Supervised or unsupervised & model types

Peer Herholz (he/him)
Postdoctoral researcher - NeuroDataScience lab at MNI/McGill, UNIQUE
Member - BIDS, ReproNim, Brainhack, Neuromod, OHBM SEA-SIG

logo logo   @peerherholz

logo

Aim(s) of this section

  • learn about the distinction between supervised & unsupervised machine learning

  • get to know the variety of potential models within each

Outline for this section

  1. supervised vs. unsupervised learning

  2. supervised learning examples

  3. unsupervised learning examples

A brief recap & first overview

  • let’s bring back our rough analysis outline that we introduced in the previous section

logo
  • so far we talked about how a Model (M) can be utilized to obtain information (output) from a certain input

  • the information requested can be manifold but roughly be situated on two broad levels:

    • learning problem

      • supervised or unsupervised

    • specific task type

      • predicting clinical measures, behavior, demographics, other properties

      • segmentation

      • discover hidden structures

      • etc.

logo

https://scikit-learn.org/stable/_static/ml_map.png

logo

https://scikit-learn.org/stable/_static/ml_map.png

logo

https://scikit-learn.org/stable/_static/ml_map.png

Learning problems - supervised vs. unsupervised

logo
  • if we now also include task type we can basically describe things via a 2 x 2 design:

logo

Our example dataset

Now that we’ve gone through a huge set of definitions and road maps, let’s go away from this rather abstract discussions to the “real deal”, i.e. seeing how these models behave in the wild. For this we’re going to sing the song “Hello example dataset my old friend, I came to apply machine learning to you again.”. Just to be sure: we will use the example dataset we briefly explored in the previous section again to showcase how the models we just talked about can be put into action, as well as how they change/affect the questions we can address and we have to interpret the results.

At first, we’re going to load our input data, i.e. X again:

import numpy as np

data = np.load('MAIN2019_BASC064_subsamp_features.npz')['a']
data.shape
(155, 2016)
  • just as a reminder: what we have in X here is a vectorized connectivity matrix containing 2016 features, which constitutes the correlation between brain region-specific time courses for each of 155 samples (participants)

  • as before, we can visualize our X to inspect it and maybe get a first idea if there might be something going on

import plotly.express as px
from IPython.core.display import display, HTML
from plotly.offline import init_notebook_mode, plot

fig = px.imshow(data, labels=dict(x="features", y="participants"), height=800, aspect='None')

fig.update(layout_coloraxis_showscale=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'input_data.html')
display(HTML('input_data.html'))
  • at this point we already need to decide on our learning problem:

    • do we want to utilize the information we already have (labels) and thus conduct a supervised learning analysis to predict Y

    • do we not want to utilize the information we already have and thus conduct an unsupervised learning analysis to e.g. find clusters or decompose

  • please note: we only do this for the sake of this workshop! Please never do this type of “Hm, maybe we do this or this, let’s see how it goes.” approach in your research. Always make sure you have a precise analyses plan that is informed by prior research and guided by the possibilities of your data. Otherwise you’ll just add to the ongoing reproducibility and credibility crisis, not accelerating but hindering scientific progress. (However, the other option is that you conduct exploratory analyses and just be honest about it, not acting as they are confirmatory.)

  • that being said: we’re going to basically test of all them (talking about “to not practise what one preaches”, eh?), again, solely for teaching reasons

  • we’re going to start with supervised learning, thus using the information we already have

Supervised learning

  • independent of the precise task type we want to run, we initially need to load the information, i.e. labels, available to us:

import pandas as pd
information = pd.read_csv('participants.csv')
information.head(n=5)
participant_id Age AgeGroup Child_Adult Gender Handedness
0 sub-pixar123 27.06 Adult adult F R
1 sub-pixar124 33.44 Adult adult M R
2 sub-pixar125 31.00 Adult adult M R
3 sub-pixar126 19.00 Adult adult F R
4 sub-pixar127 23.00 Adult adult F R
  • as you can see, we have multiple variables, i.e. labels describing our participants, i.e. samples and almost each of them can be used to address a supervised learning problem (e.g. Child_Adult)

Supervised learning

logo
  • goal: Learn parameters (or weights) of a model (M) that maps X to y

  • however, while some are categorical and thus could be employed within a classification analysis, some are continuous and thus would fit within a regression analysis (e.g. Age)

  • we’re going to check both

Supervised learning - classification

logo
  • goal: Learn parameters (or weights) of a model (M) that maps X to y

  • in order to run a classification analysis, we need to obtain the correct categorical labels defining them as our Y

Y_cat = information['Child_Adult']
Y_cat.describe()
count       155
unique        2
top       child
freq        122
Name: Child_Adult, dtype: object
  • we can see that we have two unique expressions, but let’s plot the distribution just to be sure and maybe see something important/interesting:

fig = px.histogram(Y_cat, marginal='box', template='plotly_white')

fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
  • that looked about right and we can continue with our analysis

  • to keep things easy, we will use the same pipeline we employed in the previous section, that is we will scale our input data, train a Support Vector Machine and test its predictive performance:

from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
pipe = make_pipeline(
...     StandardScaler(),
...     SVC()
... )

A bit of information about Support Vector Machines:

logo
  • non-probabilistic binary classifier

    • samples are in one of two classes

  • utilization of hyperplane as decision boundaries

    • n feature dimensions - 1

  • support vectors

    • small vs. large margins

import numpy as np
from sklearn.svm import SVC

rs = np.random.RandomState(1234)

# Generate some fake data.
n_samples = 200
# X is the input features by row.
X = np.zeros((200,3))
X[:n_samples//2] = rs.multivariate_normal( np.ones(3), np.eye(3), size=n_samples//2)
X[n_samples//2:] = rs.multivariate_normal(-np.ones(3), np.eye(3), size=n_samples//2)
# Y is the class labels for each row of X.
Y = np.zeros(n_samples); Y[n_samples//2:] = 1

# Fit the data with an svm
svc = SVC(kernel='linear')
svc.fit(X,Y)

# The equation of the separating plane is given by all x in R^3 such that:
# np.dot(svc.coef_[0], x) + b = 0. We should solve for the last coordinate
# to plot the plane in terms of x and y.

z = lambda x,y: (-svc.intercept_[0]-svc.coef_[0][0]*x-svc.coef_[0][1]*y) / svc.coef_[0][2]

tmp = np.linspace(-2,2,51)
x,y = np.meshgrid(tmp,tmp)

# Plot stuff.
fig = go.FigureWidget()
fig.add_surface(x=x, y=y, z=z(x,y), colorscale='Greys', showscale=False)
fig.add_scatter3d(x=X[Y==0,0], y=X[Y==0,1], z=X[Y==0,2], mode='markers', marker={'color': 'blue'})
fig.add_scatter3d(x=X[Y==1,0], y=X[Y==1,1], z=X[Y==1,2], mode='markers', marker={'color': 'red'})

fig.update_layout(template='plotly_white', showlegend=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'svm.html')
display(HTML('svm.html'))

Pros

  • effective in high dimensional spaces

    • Still effective in cases where number of dimensions is greater than the number of samples.

  • uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

  • versatile: different Kernel functions

Cons

  • if number of features is much greater than the number of samples: danger of over-fitting

    • make sure to check kernel and regularization

  • SVMs do not directly provide probability estimates

  • before we can go further, we need to divide our input data X into training and test sets:

X_train, X_test, y_train, y_test = train_test_split(data, Y_cat, random_state=0)
  • and can already fit our analysis pipeline:

pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('svc', SVC())])
  • followed by testing the model’s predictive performance:

print('accuracy is %s with chance level being %s' %(accuracy_score(pipe.predict(X_test), y_test), 1/len(pd.unique(Y_cat))))
accuracy is 0.8974358974358975 with chance level being 0.5

(spoiler alert: can this be right?)

Supervised learning - regression

  • after seeing that we can obtain a super high accuracy using a classification approach, we’re hooked and want to check if we can get an even better performance via addressing our learning problem via a regression approach

  • for this to work, we need to change our labels, i.e. Y from a categorical to a continuous variable:

information.head(n=5)
participant_id Age AgeGroup Child_Adult Gender Handedness
0 sub-pixar123 27.06 Adult adult F R
1 sub-pixar124 33.44 Adult adult M R
2 sub-pixar125 31.00 Adult adult M R
3 sub-pixar126 19.00 Adult adult F R
4 sub-pixar127 23.00 Adult adult F R
  • here Age seems like a good fit:

Y_con = information['Age']
Y_con.describe()
count    155.000000
mean      10.555189
std        8.071957
min        3.518138
25%        5.300000
50%        7.680000
75%       10.975000
max       39.000000
Name: Age, dtype: float64
  • however, we are of course going to plot it again (reminder: always check your data):

fig = px.histogram(Y_con, marginal='box', template='plotly_white')

fig.update_layout(showlegend=False)
init_notebook_mode(connected=True)

#fig.show()

plot(fig, filename = 'labels.html')
display(HTML('labels.html'))
  • the only thing we need to do to change our previous analysis pipeline a classification to a regression task is to adapt the estimator accordingly:

from sklearn.linear_model import LinearRegression
pipe = make_pipeline(
...     StandardScaler(),
...     LinearRegression()
... )

A bit of information about regression

  • modelling the relationship between a scalar response and one or more explanatory variables

logo

Pros

  • simple implementation, efficient & fast

  • good performance in linear separable datasets

  • can address overfitting via regularization

Cons

  • prone to underfitting

  • outlier sensitivity

  • assumption of independence

  • the rest of the workflow is almost identical to the classification approach

  • after splitting the data into train and test sets:

X_train, X_test, y_train, y_test = train_test_split(data, Y_con, random_state=0)
  • we fit the pipeline:

pipe.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])
  • which predictive performance can then be evaluated:

from sklearn.metrics import mean_absolute_error

print('mean absolute error in years: %s against a data distribution from %s to %s years' %(mean_absolute_error(pipe.predict(X_test), y_test), Y_con.min(), Y_con.max())) 
                                                                                           
mean absolute error in years: 4.116128254997565 against a data distribution from 3.518138261 to 39.0 years

Question: Is this good or bad?